8 research outputs found

    On the dynamics of interdomain routing in the Internet

    Full text link
    The routes used in the Internet's interdomain routing system are a rich information source that could be exploited to answer a wide range of questions.  However, analyzing routes is difficult, because the fundamental object of study is a set of paths. In this dissertation, we present new analysis tools -- metrics and methods -- for analyzing paths, and apply them to study interdomain routing in the Internet over long periods of time. Our contributions are threefold. First, we build on an existing metric (Routing State Distance) to define a new metric that allows us to measure the similarity between two prefixes with respect to the state of the global routing system. Applying this metric over time yields a measure of how the set of paths to each prefix varies at a given timescale. Second, we present PathMiner, a system to extract large scale routing events from background noise and identify the AS (Autonomous System) or AS-link most likely responsible for the event. PathMiner is distinguished from previous work in its ability to identify and analyze large-scale events that may re-occur many times over long timescales. We show that it is scalable, being able to extract significant events from multiple years of routing data at a daily granularity. Finally, we equip Routing State Distance with a new set of tools for identifying and characterizing unusually-routed ASes. At the micro level, we use our tools to identify clusters of ASes that have the most unusual routing at each time. We also show that analysis of individual ASes can expose business and engineering strategies of the organizations owning the ASes.  These strategies are often related to content delivery or service replication. At the macro level, we show that the set of ASes with the most unusual routing defines discernible and interpretable phases of the Internet's evolution. Furthermore, we show that our tools can be used to provide a quantitative measure of the "flattening" of the Internet

    Evaluating LLP Methods: Challenges and Approaches

    Full text link
    Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges for benchmarking learning methods. Fundamental complications arise because of the existence of different LLP variants, i.e., dependence structures that can exist between items, labels, and bags. Accordingly, the first algorithmic challenge is the generation of variant-specific datasets capturing the diversity of dependence structures and bag characteristics. The second methodological challenge is model selection, i.e., hyperparameter tuning; due to the nature of LLP, model selection cannot easily use the standard machine learning paradigm. The final benchmarking challenge consists of properly evaluating LLP solution methods across various LLP variants. We note that there is very little consideration of these issues in prior work, and there are no general solutions for these challenges proposed to date. To address these challenges, we develop methods capable of generating LLP datasets meeting the requirements of different variants. We use these methods to generate a collection of datasets encompassing the spectrum of LLP problem characteristics, which can be used in future evaluation studies. Additionally, we develop guidelines for benchmarking LLP algorithms, including the model selection and evaluation steps. Finally, we illustrate the new methods and guidelines by performing an extensive benchmark of a set of well-known LLP algorithms. We show that choosing the best algorithm depends critically on the LLP variant and model selection method, demonstrating the need for our proposed approach

    Tracking Knowledge Propagation Across Wikipedia Languages

    Get PDF
    In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow-up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this dataset is the first to explore the full inter-language propagation at a large scale. Together with the dataset, a holistic overview of the propagation and key insights about the underlying structural factors are provided to aid future research. For example, we find that although long cascades are unusual, the propagation tends to continue further once it reaches more than four language editions. We also find that the size of language editions is associated with the speed of propagation. We believe the dataset not only contributes to the prior literature on Wikipedia growth but also enables new use cases such as edit recommendation for addressing knowledge gaps, detection of disinformation, and cultural relationship analysis

    Uma análise de fatores que influenciam interações entre usuários do twitter

    No full text
    Exportado OPUSMade available in DSpace on 2019-08-12T12:37:29Z (GMT). No. of bitstreams: 1 giovannicomarela.pdf: 1689570 bytes, checksum: 9de88a86b75b803cef2bad4b1a050090 (MD5) Previous issue date: 1Nesta dissertação estuda-se o problema de entender interações entre usuários na rede de informação Twitter. O problema é abordado em duas etapas: primeiro, é realizada uma caracterização extensiva de uma grande coleção de dados, através da qual, identifica-se por exemplo que algumas vezes os usuários passam por centenas de mensagens até encontrarem alguma que tem interesse em interagir. Estes resultados motivam a identificação de fatores que influenciam as probabilidades de respostas e compartilhamento de mensagens no Twitter. Na segunda etapa, utilizando algoritmos de aprendizado de máquina, mostra-se que alguns destes fatores podem ser utilizados para melhorar o mecanismo usual de apresentação de mensagens. Estes algoritmos são avaliados através de estudos de simulação, os quais mostram que a fração de mensagens respondidas e compartilhadas próximas ao topo da lista de mensagens dos usuários cresce em até 60%.In information networks where users send messages to one another, the issue of information overload naturally arises: which are the most important messages? In this work we study the problem of understanding the importance of messages in Twitter. We approach this problem in two stages. First, we perform an extensive characterization of a very large Twitter data set which includes all users, social relations, and messages posted from the beginning of the service up to August 2009. We show evidence that information overload is present: users sometimes have to search through hundreds of messages to find those that are interesting to reply or retweet. We then identify factors that influence user response or retweet probability: previous responses to the same tweeter, the tweeter\\\'s sending rate, the age and some basic text elements of the tweet. In our second stage, we show that some of these factors can be used to improve the ordering of tweets as presented to the user. First, by inspecting user activity over time, we construct a simple on-off model of user behavior that allows us to infer when a user is actively using Twitter. Then, we explore two methods from machine learning for ranking tweets: a Naive Bayes predictor and a Support Vector Machine classifier. We show that it is possible to reorder tweets to increase the fraction of replied or retweeted messages appearing in the first positions of the list by as much as 60%

    Politics and disinformation: Analyzing the use of Telegram's information disorder network in Brazil for political mobilization

    No full text
    Over the past few years, with the increasing popularization of network communication in place of traditional mass communication, supported by social platforms and messengers, political campaigns have come to rely on new tools and methods, including the use of these structures to promote an environment of information disorder for the purpose of mobilization. This work followed the use of Telegram as a tool for political mobilization in Brazil, collecting data from a dense network of information disorder used to mobilize voters in support of then-president Jair Bolsonaro on 7 September 2021 and 2022, Independence Day in Brazil. The results showed that engagement was reduced, mainly due to the lack of support from certain groups such as anti-vaccination advocates and the truck drivers’ class. There was also a decrease in extremism on discussion themes and lower user activity levels
    corecore